Main Question
Our main question:
- How does the inclusion of different numbers of age groups influence the results of an analysis of right to carry laws and violence rates?
Disclaimer: The purpose of the Open Case Studies project is to demonstrate the use of various data science methods, tools, and software in the context of messy, real-world data. A given case study does not cover all aspects of the research process, is not claiming to be the most appropriate way to analyze a given data set, and should not be used in the context of making policy decisions without external consultation from scientific experts.
This work is licensed under the Creative Commons Attribution-NonCommercial 3.0 (CC BY-NC 3.0) United States License.
This case study will introduce the topic of multicolinearity. We will do so by showcasing a real world example where multicolinearity in part resulted in historically contriversial and conflicting findings about the influence of the adoption of right-to-carry (RTC) concealed handgun laws on violent crime rates in the United States.
We will focus on two articles:
This has been a controversial topic as many other articles also had conflicting results. See here for a list of studies.
The Donohue, et al. article discusses how there are many other important methodolical aspects besides multicolinearity that could account for the historically conflicting results in these previous papers.
In fact, nearly every aspect of the data analysis process was different between the Donohue, et al. analysis and the Lott and Mustard analysis.
However, we will focus particularly on multicolinearity and we will explore how it can influence linear regression analyses and result in different conclusions.
This analysis will demonstrate how methodological details can be critically influential for our overall conclusions and can result in important policy related consequences. This article will provide a basis for the motivation.
John J. Donohue et al., Right‐to‐Carry Laws and Violent Crime: A Comprehensive Assessment Using Panel Data and a State‐Level Synthetic Control Analysis. Journal of Empirical Legal Studies, 16,2 (2019).
David B. Mustard & John Lott. Crime, Deterrence, and Right-to-Carry Concealed Handguns. Coase-Sandor Institute for Law & Economics Working Paper No. 41, (1996).
Here you can see the differences in the data used in the featured RTC articles:
We will perform analyses similar to those in these articles, however we will not try to recreate them, instead we will simplify our analysis to allow us to focus on multicolinearity.
Therefore we will use a subset of the listed explanatory variables and they will be consistent for both analyses that we will perform, with the exception that one analysis will have 6 demographic variables like the analysis in the Donohue, et al. article and the other will have 36 demogrpahic variables like the analysis in the Lott and Mustard article.
Our main question:
Statistical Learning Objectives:
In this case study, students will learn:
1) what multicolinearity is and how it can influence linear regression coefficients
2) how to look for the presence of multicolinarity
3) the difference between multicolinearity and correlation
Data science Learning Objectives:
We will especially focus on using packages and functions from the Tidyverse, such as dplyr and ggplot2. The tidyverse is a library of packages created by RStudio. While some students may be familiar with previous R programming packages, these packages make data science in R especially efficient.
So what exactly is a right-to-carry law?
It is a law thatspecifies if and how citizens are allowed to have a firearm on their person or nearby (for example in the citizen’s car) in public.
The Second Amendment to the United States Constitution guarantees the right to “keep and bear arms”. The amendment was ratified in 1791 as part of the Bill of Rights.
However, there are no federal laws about carrying firearms in public.
These laws are created and enforced at the state level. Sates vary greatly in their laws about the right to carry firearms. Some require extensive effort to obtain a permit to legally carry a firearm, while other states require very minimal effort to legally carry a firearm.
According to Wikipedia about the history of right-to-carry policies in the United States:
Public perception on concealed carry vs open carry has largely flipped. In the early days of the United States, open carrying of firearms, long guns and revolvers was a common and well-accepted practice. Seeing guns carried openly was not considered to be any cause for alarm. Therefore, anyone who would carry a firearm but attempt to conceal it was considered to have something to hide, and presumed to be a criminal. For this reason, concealed carry was denounced as a detestable practice in the early days of the United States.
Concealed weapons bans were passed in Kentucky and Louisiana in 1813. (In those days open carry of weapons for self-defense was considered acceptable; concealed carry was denounced as the practice of criminals.) By 1859, Indiana, Tennessee, Virginia, Alabama, and Ohio had followed suit. By the end of the nineteenth century, similar laws were passed in places such as Texas, Florida, and Oklahoma, which protected some gun rights in their state constitutions. Before the mid 1900s, most U.S. states had passed concealed carry laws rather than banning weapons completely. Until the late 1990s, many Southern states were either “No-Issue” or “Restrictive May-Issue”. Since then, these states have largely enacted “Shall-Issue” licensing laws, with numerous states legalizing “Unrestricted concealed carry”.
See here for more information.
Here are the general categories of Right to Carry Laws:
You can see that none of the fifty states have no-issue laws currently, meaning that all states allow the right to carry firearms at least in some way, however the level of restrictions is dramatically different from one state to another.
Here you can see how these laws have changed over time around the country:
There is variation from state to state even within the same general category:
For example here are the current carry laws in Idaho which is considered an “Unrestricted - no permit required” state:
Idaho permits the open carrying of firearms.
Idaho law permits both residents and non-residents who are at least 18 years old to carry concealed weapons, without a carry license, outside the limits of or confines of any city, provided the person is not otherwise disqualified from being issued a license to carry.
A person may also carry concealed weapons on or about his or her person, without a license, in the person’s own place of abode or fixed place of business, on property in which the person has any ownership or leasehold interest, or on private property where the person has permission to carry from any person who has an ownership or leasehold interest in that property.
State law also allows any resident of Idaho or a current member of the armed forces of the United States to carry a concealed handgun without a license to carry, provided the person is over 18 years old and not disqualified from being issued a license to carry concealed weapons under state law. An amendment to state law that takes effect on July 1, 2020 changes the reference in the above law from “a resident of Idaho” to “any citizen of the United States.”
And here are the current carry laws in Arizona which is also considered an “Unrestricted- - no permit required” state:
Arizona respects the right of law abiding citizens to openly carry a handgun.
Any person 21 years of age or older, who is not prohibited possessor, may carry a weapon openly or concealed without the need for a license. Any person carrying without a license must acknowledge and comply with the demands of a law enforcement officer when asked if he/she is carrying a concealed deadly weapon, if the officer has initiated an “investigation” such as a traffic stop.
Notice that citizens in Idaho only need to be 18 to carry a firearm, whereas they must be 21 in Arizona.
In contrast here is an example of current carry laws in Maryland which is considered a “Rights Restricted-Very Limited Issue” state:
Carrying and Transportation in Vehicles It is unlawful for any person without a permit to wear or carry a handgun, openly or concealed, upon or about his person. It is also unlawful for any person to knowingly transport a handgun in any vehicle traveling on public roads, highways, waterways or airways, or upon roads or parking lots generally used by the public. This does not apply to any person wearing, carrying or transporting a handgun within the confines of real estate owned or leased by him, or on which he resides, or within the confines of a business establishment owned or leased by him.
Permit To Carry Application for a permit to carry a handgun is made to the Secretary of State Police. In addition to the printed application form, the applicant should submit a notarized letter stating the reasons why he is applying for a permit.
avocado….Right to carry and covid masks?
There are some important considerations regarding this data analysis to keep in mind:
We do not use all of the data used by either the Lott and Mustard or Donohue, et al. analyses, nor do we perform the same analysis of each article. We instead perform a much simpler analysis with less variables for the purposes of illustration of the concept of multicollinearity and its influence on regression coefficients, not to reproduce either analysis.
Because our analysis is an oversimplification, our analysis should not be used for determining policy changes, instead we suggest that users consult with a specialist.
We would also like to note that…AVOCADO It is important that we do not treat race as an objective measure. Despite this, it can be used to advance scientific inquiry. For more information on this topic, we have included a link to a paper on the use of race as a measure in epidemiology.
We will begin by loading the packages that we will need:
library(here)
library(car) # vif function
library(plm) # fixed effect model, linear regression
library(broom) # tidy output
library(tidyverse) # general wrangling functions
library(pdftools) # read data from pdf
library(readxl) # importing excel sheets
library(cowplot) # to produce plot of plots
library(GGally)
library(ggrepel)
library(scales)
library(latex2exp)
library(viridis)
library(ggcorrplot)
library(rsample)
set.seed(999)| Package | Use |
|---|---|
| here | to easily load and save data |
| readr | to import the CSV file data |
| [car] | to calculate vif values |
The first time we use a function, we will use the :: to indicate which package we are using. Unless we have overlapping function names, this is not necessary, but we will include it here to be informative about where the functions we will use come from.
Below is a table from the Donohue, et al. paper that shows the data used in both analyses, where DAW stands for Donohue, et al. and LM stands for Lott and Mustard.
We will be using a subset of these variables, which are highlighted in green:
##State FIPS codes Avocado why do we need State FIPS?
The following data was downloaded from the US Census Bureau.
To import the data we will use the read_xls() function of the readxl package. Since the first five lines of this excel is information about the source of the data and when it was released, we need to skip importing these lines using the skip argument so that the data has the same number of columns for each row.
To obtain information about age, sex, and race, and overall population we will use US Census Bureau data, just like both of the articles. The cesnus data is available for different time spans. Here are the links for the years used in our analysis. We will use data from 1977 to 2010.
| Data | Link |
|---|---|
| years 1977 to 1979 | link |
| years 1980 to 1989 | link * county data was used for this decade |
| years 1990 to 1999 | link |
| years 2000 to 2010 | link technical documentation |
To import the data we will use the read_csv() function of the readr package for the csv files. In some decades, there are separate files for each year, we will read each of these together using the base list.files() function to get all of the names for each file and then the map() function of the purrr package to apply the read_csv() function on all of the file paths in the list created by list.files(). For years that are txt files we will use read_table2() also fo the readr package. The read_table2() function, unlike the read_table(), allows for any number of whitespace characters between columns, and the lines can be of different lengths.
AVOCADO I am a bit confused about the last decade… it’s only one file but it seems to need map…
dem_77_79 <- read_csv("docs/Demographics/Decade_1970/pe-19.csv", skip = 5)
dem_80_89 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1980/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(., skip=5))
dem_90_99 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_1990/",
pattern = "*.txt",
full.names = TRUE) %>%
map(~read_table2(., skip = 14))
dem_90_99 <- dem_90_99 %>%
map_df(bind_rows)
dem_00_10_2 <- read_csv("docs/Demographics/Decade_2000/st-est00int-alldata.csv")
dem_00_10 <- list.files(recursive = TRUE,
path = "docs/Demographics/Decade_2000/",
pattern = "*.csv",
full.names = TRUE) %>%
map(~read_csv(.))The following data was downloaded from the Federal Bureau of Investigation.
The read_csv() function of the readr package guesses what the class is for each variable, but sometimes it makes mistakes. It is good to specify the class for variables if you know them. We know that we want the variables about male and female counts to be numberic. We can specify that using the col_types = argument. See here and here for more information. One
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv")
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = "n",
female_total_ct = "n"))
ps_data <- read_csv("docs/Police_staffing/pe_1960_2018.csv",
col_types = cols(male_total_ct = col_double(),
female_total_ct = col_double()))https://data.bls.gov/cgi-bin/dsrv?la
ue_rate_data <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(., skip = 10))
ue_rate_names <- list.files(recursive = TRUE,
path = "docs/Unemployment",
pattern = "*.xlsx",
full.names = TRUE) %>%
map(~read_xlsx(.)) %>%
sapply(., "[",7,2, drop=TRUE)#Extracted from Table 21 from [US Census Bureau](https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-poverty-people.html)
#**persistent warning from unknown origin** https://community.rstudio.com/t/persistent-unknown-or-uninitialised-column-warnings/64879
#solution to above is alledgedly: "In any case the suggested approach is to initialize the column"
poverty_rate_data <- read_xls("docs/Poverty/hstpov21.xls", skip=2) #This may cause initialization issue, not easily reproducible (even after restarting R)Violent crime data was obtained from here
# A tibble: 6 x 4
Region Division `State\n(FIPS)` Name
<chr> <chr> <chr> <chr>
1 1 0 00 Northeast Region
2 1 1 00 New England Division
3 1 1 09 Connecticut
4 1 1 23 Maine
5 1 1 25 Massachusetts
6 1 1 33 New Hampshire
[1] "character"
# A tibble: 6 x 22
`Year of Estima… `FIPS State Cod… `State Name` `Race/Sex Indic…
<dbl> <chr> <chr> <chr>
1 1970 01 Alabama White male
2 1970 01 Alabama White female
3 1970 01 Alabama Black male
4 1970 01 Alabama Black female
5 1970 01 Alabama Other races male
6 1970 01 Alabama Other races fem…
# … with 18 more variables: `Under 5 years` <dbl>, `5 to 9 years` <dbl>, `10 to
# 14 years` <dbl>, `15 to 19 years` <dbl>, `20 to 24 years` <dbl>, `25 to 29
# years` <dbl>, `30 to 34 years` <dbl>, `35 to 39 years` <dbl>, `40 to 44
# years` <dbl>, `45 to 49 years` <dbl>, `50 to 54 years` <dbl>, `55 to 59
# years` <dbl>, `60 to 64 years` <dbl>, `65 to 69 years` <dbl>, `70 to 74
# years` <dbl>, `75 to 79 years` <dbl>, `80 to 84 years` <dbl>, `85 years and
# over` <dbl>
[1] "Year of Estimate" "FIPS State Code" "State Name"
[4] "Race/Sex Indicator" "Under 5 years" "5 to 9 years"
[7] "10 to 14 years" "15 to 19 years" "20 to 24 years"
[10] "25 to 29 years" "30 to 34 years" "35 to 39 years"
[13] "40 to 44 years" "45 to 49 years" "50 to 54 years"
[16] "55 to 59 years" "60 to 64 years" "65 to 69 years"
[19] "70 to 74 years" "75 to 79 years" "80 to 84 years"
[22] "85 years and over"
[1] "numeric"
dem_77_79 <- dem_77_79 %>%
mutate(RACE = case_when(str_detect(`Race/Sex Indicator`,"Black") ~ "Black",
str_detect(`Race/Sex Indicator`,"White") ~ "White",
TRUE ~ "Other"),
SEX = case_when(str_detect(`Race/Sex Indicator`,"female") ~ "Female",
TRUE ~ "Male")) %>%
dplyr::select(-`Race/Sex Indicator`,-`FIPS State Code`)
dem_77_79 <- dem_77_79 %>%
rename("YEAR"=`Year of Estimate`,
"STATE"=`State Name`) %>%
filter(YEAR %in% 1977:1979)
dem_77_79 <- dem_77_79 %>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP")
colnames(dem_77_79)[1] "YEAR" "STATE" "RACE" "SEX" "AGE_GROUP" "SUB_POP"
pop_77_79 <- dem_77_79 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
colnames(pop_77_79)[1] "YEAR" "STATE" "TOT_POP"
Year of Estimate FIPS State and County Codes
"numeric" "character"
Race/Sex Indicator Under 5 years
"character" "numeric"
5 to 9 years 10 to 14 years
"numeric" "numeric"
15 to 19 years 20 to 24 years
"numeric" "numeric"
25 to 29 years 30 to 34 years
"numeric" "numeric"
35 to 39 years 40 to 44 years
"numeric" "numeric"
45 to 49 years 50 to 54 years
"numeric" "numeric"
55 to 59 years 60 to 64 years
"numeric" "numeric"
65 to 69 years 70 to 74 years
"numeric" "numeric"
75 to 79 years 80 to 84 years
"numeric" "numeric"
85 years and over
"numeric"
dem_80_89 <- dem_80_89 %>%
mutate(RACE = case_when(str_detect(`Race/Sex Indicator`,"Black") ~ "Black",
str_detect(`Race/Sex Indicator`,"White") ~ "White",
TRUE ~ "Other"),
SEX = case_when(str_detect(`Race/Sex Indicator`,"female") ~ "Female",
TRUE ~ "Male")) %>%
dplyr::select(-`Race/Sex Indicator`)
colnames(dem_80_89) [1] "Year of Estimate" "FIPS State and County Codes"
[3] "Under 5 years" "5 to 9 years"
[5] "10 to 14 years" "15 to 19 years"
[7] "20 to 24 years" "25 to 29 years"
[9] "30 to 34 years" "35 to 39 years"
[11] "40 to 44 years" "45 to 49 years"
[13] "50 to 54 years" "55 to 59 years"
[15] "60 to 64 years" "65 to 69 years"
[17] "70 to 74 years" "75 to 79 years"
[19] "80 to 84 years" "85 years and over"
[21] "RACE" "SEX"
dem_80_89 <- dem_80_89 %>%
rename("YEAR"=`Year of Estimate`,
"STATEFP_temp"=`FIPS State and County Codes`) %>%
mutate(STATEFP = substr(STATEFP_temp, start = 1, stop = 2)) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
dem_80_89 <- dem_80_89 %>%
pivot_longer(cols=contains("years"),
names_to = "AGE_GROUP",
values_to = "SUB_POP_temp") %>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarise(SUB_POP = sum(SUB_POP_temp), .groups="drop")
colnames(dem_80_89)[1] "YEAR" "STATE" "AGE_GROUP" "SEX" "RACE" "SUB_POP"
pop_80_89 <- dem_80_89 %>%
group_by(YEAR, STATE) %>%
summarise("TOT_POP" = sum(SUB_POP), .groups = "drop")
colnames(pop_80_89)[1] "YEAR" "STATE" "TOT_POP"
[1] "Year" "e" "Age" "Male" "Female" "Male_1"
[7] "Female_1" "Male_2" "Female_2" "Male_3" "Female_3" "Male_4"
[13] "Female_4" "Male_5" "Female_5" "Male_6" "Female_6" "Male_7"
[19] "Female_7"
# A tibble: 6 x 19
Year e Age Male Female Male_1 Female_1 Male_2 Female_2 Male_3 Female_3
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA NA
2 1990 01 0 20406 19101 9794 9414 103 90 192 170
3 1990 01 1 19393 18114 9475 9247 87 93 146 182
4 1990 01 2 18990 18043 9097 8837 97 100 175 160
5 1990 01 3 19246 17786 9002 8701 94 115 150 157
6 1990 01 4 19502 18366 9076 8989 108 114 168 178
# … with 8 more variables: Male_4 <dbl>, Female_4 <dbl>, Male_5 <dbl>,
# Female_5 <dbl>, Male_6 <dbl>, Female_6 <dbl>, Male_7 <dbl>, Female_7 <dbl>
colnames(dem_90_99) <- c("YEAR",
"STATEFP",
"Age",
"NH_W_M",
"NH_W_F",
"NH_B_M",
"NH_B_F",
"NH_AIAN_M",
"NH_AIAN_F",
"NH_API_M",
"NH_API_F",
"H_W_M",
"H_W_F",
"H_B_M",
"H_B_F",
"H_AIAN_M",
"H_AIAN_F",
"H_API_M",
"H_API_F")
dim(dem_90_99)[1] 43870 19
dem_90_99 <- dem_90_99 %>%
mutate(W_M = NH_W_M + H_W_M,
W_F = NH_W_F + H_W_F,
B_M = NH_B_M + H_B_M,
B_F = NH_B_F + H_B_F,
AIAN_M = NH_AIAN_M + H_AIAN_M,
AIAN_F = NH_AIAN_F + H_AIAN_F,
API_M = NH_API_M + H_API_M,
API_F = NH_API_F + H_API_F,
n_na = rowSums(is.na(.))) %>%
dplyr::select(-starts_with("NH_"), -starts_with("H_"))
dem_90_99 %>%
group_by(n_na) %>%
tally()# A tibble: 2 x 2
n_na n
<dbl> <int>
1 0 43860
2 19 10
empty_rows_na <- dem_90_99 %>%
group_by(n_na) %>%
tally() %>%
filter(n_na != 0) %>%
pull(n_na)
dem_90_99 <- dem_90_99 %>%
filter(n_na != empty_rows_na) %>%
dplyr::select(-n_na)
sapply(dem_90_99, class) YEAR STATEFP Age W_M W_F B_M
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
B_F AIAN_M AIAN_F API_M API_F
"numeric" "numeric" "numeric" "numeric" "numeric"
10 to 14 years 15 to 19 years 20 to 24 years 25 to 29 years
3060 3060 3060 3060
30 to 34 years 35 to 39 years 40 to 44 years 45 to 49 years
3060 3060 3060 3060
5 to 9 years 50 to 54 years 55 to 59 years 60 to 64 years
3060 3060 3060 3060
65 to 69 years 70 to 74 years 75 to 79 years 80 to 84 years
3060 3060 3060 3060
85 years and over Under 5 years
3060 3060
dem_90_99 <- dem_90_99 %>%
mutate(AGE_GROUP = cut(Age,
breaks = seq(0,90, by=5),
right = FALSE,
labels = c("Under 5 years",
"5 to 9 years",
"10 to 14 years",
"15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years",
"40 to 44 years",
"45 to 49 years",
"50 to 54 years",
"55 to 59 years",
"60 to 64 years",
"65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over")
)) %>%
dplyr::select(-Age) %>%
mutate(AGE_GROUP = as.character(AGE_GROUP))
sapply(dem_90_99, class) YEAR STATEFP W_M W_F B_M B_F
"numeric" "character" "numeric" "numeric" "numeric" "numeric"
AIAN_M AIAN_F API_M API_F AGE_GROUP
"numeric" "numeric" "numeric" "numeric" "character"
dem_90_99 <- dem_90_99 %>%
group_by(YEAR, STATEFP, AGE_GROUP) %>%
summarise_at(vars(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")), sum) %>%
ungroup() %>%
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP")
dem_90_99 <- dem_90_99 %>%
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other")) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
pop_90_99 <- dem_90_99 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_90_99 <- dem_90_99 %>%
left_join(pop_90_99, by=c("YEAR", "STATE")) %>%
mutate(PERC_SUB_POP = SUB_POP*100/TOT_POP) %>%
dplyr::select(-SUB_POP, -TOT_POP) REGION DIVISION STATE NAME
"numeric" "numeric" "numeric" "character"
SEX ORIGIN RACE AGEGRP
"numeric" "numeric" "numeric" "numeric"
ESTIMATESBASE2000 POPESTIMATE2000 POPESTIMATE2001 POPESTIMATE2002
"numeric" "numeric" "numeric" "numeric"
POPESTIMATE2003 POPESTIMATE2004 POPESTIMATE2005 POPESTIMATE2006
"numeric" "numeric" "numeric" "numeric"
POPESTIMATE2007 POPESTIMATE2008 POPESTIMATE2009 CENSUS2010POP
"numeric" "numeric" "numeric" "numeric"
POPESTIMATE2010
"numeric"
dem_00_10 <- dem_00_10 %>%
dplyr::select(-ESTIMATESBASE2000,-CENSUS2010POP) %>%
filter(REGION != 0,
DIVISION != 0,
SEX != 0,
ORIGIN == 0,
RACE != 0,
AGEGRP != 0,
STATE != 0) %>%
dplyr::select(-REGION, -DIVISION, -ORIGIN, -STATE) %>%
rename("STATE"=NAME,
"AGE_GROUP"=AGEGRP) %>%
mutate(SEX = factor(SEX,
levels = 1:2,
labels = c("Male",
"Female")),
RACE = factor(RACE,
levels = 1:6,
labels = c("White",
"Black",
rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP,
levels = 1:18,
labels = c("Under 5 years",
"5 to 9 years",
"10 to 14 years",
"15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years",
"40 to 44 years",
"45 to 49 years",
"50 to 54 years",
"55 to 59 years",
"60 to 64 years",
"65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(SEX = as.character(SEX),
RACE = as.character(RACE),
AGE_GROUP = as.character(AGE_GROUP))
colnames(dem_00_10) [1] "STATE" "SEX" "RACE" "AGE_GROUP"
[5] "POPESTIMATE2000" "POPESTIMATE2001" "POPESTIMATE2002" "POPESTIMATE2003"
[9] "POPESTIMATE2004" "POPESTIMATE2005" "POPESTIMATE2006" "POPESTIMATE2007"
[13] "POPESTIMATE2008" "POPESTIMATE2009" "POPESTIMATE2010"
dem_00_10 <- dem_00_10 %>%
pivot_longer(cols=contains("ESTIMATE"),
names_to = "YEAR",
values_to = "SUB_POP")
dem_00_10 <- dem_00_10 %>%
mutate(YEAR = str_sub(YEAR, start=-4)) %>%
mutate(YEAR = as.numeric(YEAR))
sapply(dem_00_10, class) STATE SEX RACE AGE_GROUP YEAR SUB_POP
"character" "character" "character" "character" "numeric" "numeric"
pop_00_10 <- dem_00_10 %>%
group_by(YEAR, STATE) %>%
summarise(TOT_POP = sum(SUB_POP), .groups = "drop")
dem_00_10 %>%
left_join(pop_00_10, by=c("YEAR", "STATE")) %>%
group_by(YEAR, STATE) %>%
mutate(PERC_SUB_POP = SUB_POP*100/TOT_POP) %>%
summarise(perc_tot = sum(PERC_SUB_POP), .groups = "drop") %>%
mutate(poss_error = case_when(abs(perc_tot - 100) > 0 ~ TRUE,
TRUE ~ FALSE)) %>%
group_by(poss_error) %>%
tally()# A tibble: 1 x 2
poss_error n
<lgl> <int>
1 FALSE 561
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 6
YEAR STATE RACE SEX AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama White Male Under 5 years 2.61
2 1977 Alabama White Male 5 to 9 years 3.00
3 1977 Alabama White Male 10 to 14 years 3.25
4 1977 Alabama White Male 15 to 19 years 3.58
5 1977 Alabama White Male 20 to 24 years 3.33
6 1977 Alabama White Male 25 to 29 years 2.95
# A tibble: 6 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years Female Black 1.28
2 1980 Alabama 10 to 14 years Female Other 0.0206
3 1980 Alabama 10 to 14 years Female White 2.80
4 1980 Alabama 10 to 14 years Male Black 1.30
5 1980 Alabama 10 to 14 years Male Other 0.0212
6 1980 Alabama 10 to 14 years Male White 2.97
# A tibble: 6 x 6
YEAR AGE_GROUP RACE SEX STATE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1990 10 to 14 years White Male Alabama 2.46
2 1990 10 to 14 years White Female Alabama 2.33
3 1990 10 to 14 years Black Male Alabama 1.21
4 1990 10 to 14 years Black Female Alabama 1.20
5 1990 10 to 14 years Other Male Alabama 0.0239
6 1990 10 to 14 years Other Female Alabama 0.0235
# A tibble: 6 x 6
STATE SEX RACE AGE_GROUP YEAR PERC_SUB_POP
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 Alabama Male White Under 5 years 2000 2.24
2 Alabama Male White Under 5 years 2001 2.24
3 Alabama Male White Under 5 years 2002 2.22
4 Alabama Male White Under 5 years 2003 2.21
5 Alabama Male White Under 5 years 2004 2.20
6 Alabama Male White Under 5 years 2005 2.20
[1] 18
[1] 18
[1] 18
[1] 18
dem <- bind_rows(dem_77_79,
dem_80_89,
dem_90_99,
dem_00_10)
dem %>%
filter(RACE == "Other") %>%
group_by(YEAR) %>%
tally() %>%
summarise(years_data = n())# A tibble: 1 x 1
years_data
<int>
1 34
[1] 34
DONOHUE_AGE_GROUPS <- c("15 to 19 years",
"20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years")
DONOHUE_RACE <- c("White",
"Black",
"Other")
DONOHUE_SEX <- c("Male")
dem_DONOHUE <- dem %>%
filter(AGE_GROUP %in% DONOHUE_AGE_GROUPS,
RACE %in% DONOHUE_RACE,
SEX %in% DONOHUE_SEX) %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP, "20 to 39 years"=c("20 to 24 years",
"25 to 29 years",
"30 to 34 years",
"35 to 39 years"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)
LOTT_AGE_GROUPS_NULL <- c("Under 5 years",
"5 to 9 years")
LOTT_RACE <- c("White",
"Black",
"Other")
LOTT_SEX <- c("Male",
"Female")
dem_LOTT <- dem %>%
filter(!(AGE_GROUP %in% LOTT_AGE_GROUPS_NULL),
RACE %in% LOTT_RACE,
SEX %in% LOTT_SEX) %>%
mutate(AGE_GROUP = fct_collapse(AGE_GROUP,
"10 to 19 years"=c("10 to 14 years",
"15 to 19 years"),
"20 to 29 years"=c("20 to 24 years",
"25 to 29 years"),
"30 to 39 years"=c("30 to 34 years",
"35 to 39 years"),
"40 to 49 years"=c("40 to 44 years",
"45 to 49 years"),
"50 to 64 years"=c("50 to 54 years",
"55 to 59 years",
"60 to 64 years"),
"65 years and over"=c("65 to 69 years",
"70 to 74 years",
"75 to 79 years",
"80 to 84 years",
"85 years and over"))) %>%
mutate(AGE_GROUP = str_replace_all(AGE_GROUP," ","_")) %>%
group_by(YEAR, STATE, RACE, SEX, AGE_GROUP) %>%
summarise(PERC_SUB_POP = sum(PERC_SUB_POP), .groups = "drop") %>%
unite(col = "VARIABLE", RACE, SEX, AGE_GROUP, sep = "_") %>%
rename("VALUE"=PERC_SUB_POP)
dim(expand.grid(c(1:6), c(7:8), c(9:10)))[1][1] 24
[1] TRUE
[1] TRUE
[1] TRUE
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1980 Alabama 3899671
2 1980 Alaska 404680
3 1980 Arizona 2735840
4 1980 Arkansas 2288809
5 1980 California 23792840
6 1980 Colorado 2909545
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1990 Alabama 4048508
2 1990 Alaska 553120
3 1990 Arizona 3679056
4 1990 Arkansas 2354343
5 1990 California 29950111
6 1990 Colorado 3303862
# A tibble: 6 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 2000 Alabama 4452173
2 2000 Alaska 627963
3 2000 Arizona 5160586
4 2000 Arkansas 2678588
5 2000 California 33987977
6 2000 Colorado 4326921
population_data <- bind_rows(pop_77_79,
pop_80_89,
pop_90_99,
pop_00_10)
population_data %>%
group_by(YEAR) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 34 x 2
YEAR n
<dbl> <int>
1 1977 51
2 1978 51
3 1979 51
4 1980 51
5 1981 51
6 1982 51
7 1983 51
8 1984 51
9 1985 51
10 1986 51
11 1987 51
12 1988 51
13 1989 51
14 1990 51
15 1991 51
16 1992 51
17 1993 51
18 1994 51
19 1995 51
20 1996 51
21 1997 51
22 1998 51
23 1999 51
24 2000 51
25 2001 51
26 2002 51
27 2003 51
28 2004 51
29 2005 51
30 2006 51
31 2007 51
32 2008 51
33 2009 51
34 2010 51
[1] "data_year" "ori" "pub_agency_name"
[4] "pub_agency_unit" "state_abbr" "division_name"
[7] "region_name" "county_name" "agency_type_name"
[10] "population_group_desc" "population" "male_officer_ct"
[13] "male_civilian_ct" "male_total_ct" "female_officer_ct"
[16] "female_civilian_ct" "female_total_ct" "officer_ct"
[19] "civilian_ct" "total_pe_ct" "pe_ct_per_1000"
ps_data <- ps_data %>%
filter(data_year >= 1977,
data_year <= 2014) %>%
mutate(male_total_ct = case_when(is.na(male_total_ct) ~ 0,
TRUE ~ male_total_ct),
female_total_ct = case_when(is.na(female_total_ct) ~ 0,
TRUE ~ female_total_ct)) %>%
mutate(officer_total = male_total_ct + female_total_ct) %>%
dplyr::select(data_year,
pub_agency_name,
state_abbr,
officer_total)
ps_data <- ps_data %>%
group_by(data_year, state_abbr) %>%
summarise(officer_state_total=sum(officer_total), .groups = "drop")
ps_data %>%
group_by(state_abbr) %>%
tally() %>%
print(n = dim(.)[1])# A tibble: 59 x 2
state_abbr n
<chr> <int>
1 AK 38
2 AL 38
3 AR 38
4 AS 38
5 AZ 38
6 CA 38
7 CO 38
8 CT 38
9 CZ 38
10 DC 38
11 DE 38
12 FL 38
13 FS 38
14 GA 38
15 GM 38
16 HI 38
17 IA 38
18 ID 38
19 IL 38
20 IN 38
21 KS 38
22 KY 38
23 LA 38
24 MA 38
25 MD 38
26 ME 38
27 MI 38
28 MN 38
29 MO 38
30 MP 38
31 MS 38
32 MT 38
33 NB 38
34 NC 38
35 ND 38
36 NH 38
37 NJ 38
38 NM 38
39 NV 38
40 NY 38
41 OH 38
42 OK 38
43 OR 38
44 OT 38
45 PA 38
46 PR 38
47 RI 38
48 SC 38
49 SD 38
50 TN 38
51 TX 38
52 UT 38
53 VA 38
54 VI 38
55 VT 38
56 WA 38
57 WI 38
58 WV 38
59 WY 38
# NB is Nebraska. This was changed to NE to avoid confusions with NB in Canada. This dataset uses NB
state_of_interest_NULL <- c("AS",
"GM",
"CZ",
"FS",
"MP",
"OT",
"PR",
"VI")
state_abb_df <- as.data.frame(cbind(state.abb, state.name))
colnames(state_abb_df) <- c("state_abbr", "STATE")
print(state_abb_df) state_abbr STATE
1 AL Alabama
2 AK Alaska
3 AZ Arizona
4 AR Arkansas
5 CA California
6 CO Colorado
7 CT Connecticut
8 DE Delaware
9 FL Florida
10 GA Georgia
11 HI Hawaii
12 ID Idaho
13 IL Illinois
14 IN Indiana
15 IA Iowa
16 KS Kansas
17 KY Kentucky
18 LA Louisiana
19 ME Maine
20 MD Maryland
21 MA Massachusetts
22 MI Michigan
23 MN Minnesota
24 MS Mississippi
25 MO Missouri
26 MT Montana
27 NE Nebraska
28 NV Nevada
29 NH New Hampshire
30 NJ New Jersey
31 NM New Mexico
32 NY New York
33 NC North Carolina
34 ND North Dakota
35 OH Ohio
36 OK Oklahoma
37 OR Oregon
38 PA Pennsylvania
39 RI Rhode Island
40 SC South Carolina
41 SD South Dakota
42 TN Tennessee
43 TX Texas
44 UT Utah
45 VT Vermont
46 VA Virginia
47 WA Washington
48 WV West Virginia
49 WI Wisconsin
50 WY Wyoming
state_abb_df <- state_abb_df %>%
add_row(state_abbr="DC",
STATE="District of Columbia")
denominator_temp <- population_data %>%
dplyr::select(-VARIABLE) %>%
rename("Population_temp"=VALUE)
ps_data <- ps_data %>%
filter(!(state_abbr %in% state_of_interest_NULL)) %>%
mutate(state_abbr = case_when(state_abbr == "NB" ~ "NE",
TRUE ~ state_abbr)) %>%
left_join(state_abb_df, by = "state_abbr") %>%
dplyr::select(-state_abbr) %>%
rename(YEAR = "data_year",
VALUE = "officer_state_total") %>%
mutate(VARIABLE = "officer_state_total") %>%
left_join(denominator_temp, by=c("STATE","YEAR")) %>%
mutate(VALUE = (VALUE*100000) / Population_temp) %>%
mutate(VALUE = lag(VALUE)) %>%
mutate(VARIABLE = "police_per_100k_lag") %>%
dplyr::select(-Population_temp)# A tibble: 1 x 14
Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020 3.2 2.9 3 13.3 NA NA NA NA NA NA NA NA
# … with 1 more variable: Annual <dbl>
[1] "STATE" "Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun"
[9] "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" "Annual"
STATE Year Jan Feb Mar Apr
"character" "numeric" "numeric" "numeric" "numeric" "numeric"
May Jun Jul Aug Sep Oct
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
Nov Dec Annual
"numeric" "numeric" "numeric"
# A tibble: 6 x 6
`NOTE: Number in thousa… ...2 ...3 ...4 ...5 ...6
<chr> <chr> <chr> <chr> <chr> <chr>
1 2018 <NA> <NA> <NA> <NA> <NA>
2 STATE Total Number "Standard\n… Percent "Standard\ner…
3 Alabama 4877 779 "65" 16 "1.3"
4 Alaska 720 94 "9" 13.1 "1.2"
5 Arizona 7241 929 "80" 12.80000000… "1.1000000000…
6 Arkansas 2912 462 "38" 15.9 "1.3"
colnames(poverty_rate_data) <- c("STATE",
"Total",
"Number",
"Number_se",
"Percent",
"Percent_se")
tail(poverty_rate_data)# A tibble: 6 x 6
STATE Total Number Number_se Percent Percent_se
<chr> <chr> <chr> <chr> <chr> <chr>
1 Wisconsin 4724 403 57 8.5 1.1000000000…
2 Wyoming 468 49 20 10.4 4
3 Standard errors shown in this ta… <NA> <NA> <NA> <NA> <NA>
4 For information on confidentiali… <NA> <NA> <NA> <NA> <NA>
5 Footnotes are available at <www.… <NA> <NA> <NA> <NA> <NA>
6 SOURCE: U.S. Bureau of the Censu… <NA> <NA> <NA> <NA> <NA>
notes <- 4
poverty_rate_data <- poverty_rate_data[-((dim(poverty_rate_data)[1]-notes+1):dim(poverty_rate_data)[1]),]
states_eq <- 51
extra_col <- 2
rep_rows <- states_eq + extra_col
groups <- (dim(poverty_rate_data)[1])/(rep_rows)
paste(groups - (2018-1980 + 1), "extra groups")[1] "2 extra groups"
poverty_rate_data$year_group <- rep(1:groups, each=rep_rows)
poverty_rate_data <- poverty_rate_data %>%
group_by(year_group) %>%
group_split()
head(poverty_rate_data[[1]])# A tibble: 6 x 7
STATE Total Number Number_se Percent Percent_se year_group
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 2018 <NA> <NA> <NA> <NA> <NA> 1
2 STATE Total Number "Standard\ner… Percent "Standard\nerro… 1
3 Alabama 4877 779 "65" 16 "1.3" 1
4 Alaska 720 94 "9" 13.1 "1.2" 1
5 Arizona 7241 929 "80" 12.8000000000… "1.100000000000… 1
6 Arkans… 2912 462 "38" 15.9 "1.3" 1
poverty_rate_data <- poverty_rate_data %>%
map(~mutate(.,
row_id = row_number())) %>%
map(~filter(.,row_id != 2)) %>%
map(~dplyr::select(.,-row_id))
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_replace_all(.,"[:space:]","_")
names(poverty_rate_data) <- poverty_rate_data_names
# Recall 2 extra groups.
# footnotes available at https://www.census.gov/topics/income-poverty/poverty/guidance/poverty-footnotes/cps-historic-footnotes.html
poverty_rate_data$`2017_(21)` <- NULL
poverty_rate_data$`2013_(19)` <- NULL
poverty_rate_data_names <- poverty_rate_data %>%
sapply(., "[",1,1, drop=TRUE) %>%
str_sub(., start = 1, end=4)
names(poverty_rate_data) <- poverty_rate_data_names
poverty_rate_data <- poverty_rate_data %>%
map_df(bind_rows, .id = "YEAR") %>%
dplyr::select(-year_group)
poverty_rate_data <- poverty_rate_data %>%
mutate(n_na = rowSums(is.na(.)))
# This shows that there is systematic missing values stemmingly *solely* from the rows without poverty data and only a label designating the year
poverty_rate_data %>%
group_by(n_na) %>%
tally()# A tibble: 2 x 2
n_na n
<dbl> <int>
1 0 1989
2 5 39
YEAR STATE Total Number Number_se Percent
"character" "character" "character" "character" "character" "character"
Percent_se n_na
"character" "numeric"
poverty_rate_data <- poverty_rate_data %>%
drop_na() %>%
dplyr::select(-Number,
-Number_se,
-Percent_se,
-n_na,
-Total) %>%
rename("VALUE"=Percent) %>%
mutate(VARIABLE = "Poverty_rate",
YEAR = as.numeric(YEAR),
VALUE = as.numeric(VALUE))
colnames(poverty_rate_data)[1] "YEAR" "STATE" "VALUE" "VARIABLE"
https://www.ucrdatatool.gov/Search/Crime/State/StatebyState.cfm
crime_data <- read_lines("docs/Crime/CrimeStatebyState.csv", skip = 2, skip_empty_rows = TRUE)
length(crime_data)[1] 2254
crime_data <- crime_data[-(2143:length(crime_data))]
x <- 2014-1977+1
rep_cycle <- 2 + 2 + x
rep_cycle_cut <- 2 + x
delete_rows <- c(seq(2,length(crime_data),rep_cycle),
seq(3,length(crime_data),rep_cycle))
crime_data <- crime_data[-delete_rows]
crime_data <- data.frame(cbind(crime_data, rep(1:(length(crime_data)/rep_cycle_cut),each=rep_cycle_cut)))
colnames(crime_data) <- c("String","STATE_GROUP")
crime_data <- crime_data %>%
group_by(STATE_GROUP) %>%
group_split()
columns_crime_data <- 8
crime_data <- crime_data %>%
map(~mutate(.,
State = case_when(str_detect(String, "Estimated crime in ") ~ substring(String, nchar("Estimated crime in ")+1)),
row_id = row_number())) %>%
map(~fill(., State)) %>%
map(~filter(.,row_id > 2)) %>%
map(~mutate(.,
String = paste0(String, ",", State))) %>%
map(~dplyr::select(.,String)) %>%
map(~str_split_fixed(.$String,",",columns_crime_data + 1)) %>%
map(~data.frame(.)) %>%
map(~rename(.,"YEAR"=X1,
"Extra_col1"=X2,
"VC"=X3,
"Extra_col2"=X4,
"Extra_col3"=X5,
"Extra_col4"=X6,
"Extra_col5"=X7,
"Extra_col6"=X8,
"STATE"=X9)) %>%
map(~dplyr::select(.,-contains("Extra_col"))) %>%
map(~.x %>% mutate_all(~trimws(.,which = "both"))) %>%
map_df(bind_rows)
sapply(crime_data, class) YEAR VC STATE
"character" "character" "character"
syn_control_paper_p_62 <- syn_control_paper[[62]]
p_62 <- syn_control_paper_p_62 %>%
strsplit("\n") %>%
unlist() %>%
as.data.frame() %>%
slice(-(1:2))
apply(p_62, 1, nchar) [1] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[20] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109 109
[39] 109 109 109 109 109 109 109 109 109 109 109 109 109 109 63
[1] " 60"
[1] 3 4 4 4 3 4 2 3 2 4 4 3 4 3 4 4 4 4 4 4 3 2 4 4 4 4 4 4 4 2 3 3 3 3 3 4 4 4
[39] 3 3 2 3 3 4 4 4 3 4 2 3 4 4
[1] 2 3 3 3 2 3 2 2 2 3 3 2 3 2 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 2 2 3 2 3 3 3 3 3
[39] 3 3 2 3 3 3 3 3 2 3 2 3 3 3
[1] 2 2 2 2 1 2 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2
[39] 2 2 1 2 2 2 2 2 2 2 1 2 2 2
[1] 1 0 0 0 1 0 1 1 1 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0
[39] 0 0 1 0 0 0 0 0 1 0 1 0 0 0
.
1 Alabama 1975 1975
2 Alaska 10/1/1994 0.252 1995
3 Arizona 7/17/1994 0.460 1995
4 Arkansas 7/27/1995 0.433 1996
5 California N/A 0
6 Colorado 5/17/2003 0.627 2003
apply(p_62, 1, str_count, "\\\\s{40,}")
1 1
2 0
3 0
4 0
5 1
6 0
p_62 <- p_62 %>%
apply(1,str_replace_all, "\\s{40,}", "|N/A|") %>%
str_replace_all("\\s{2,15}", "|") %>%
as.data.frame()
p_62 <- sapply(p_62$., str_split, "\\|{1,}")
sapply(p_62, nchar)$`|Alabama||1975|N/A|1975`
[1] 0 7 4 3 4
$`|Alaska||10/1/1994||0.252|||1995`
[1] 0 6 9 5 4
$`|Arizona| 7/17/1994||0.460|||1995`
[1] 0 7 10 5 4
$`|Arkansas| 7/27/1995||0.433|||1996`
[1] 0 8 10 5 4
$`|California||N/A|N/A|0`
[1] 0 10 3 3 1
$`|Colorado| 5/17/2003||0.627|||2003`
[1] 0 8 10 5 4
$`|Connecticut||1970|N/A|1970`
[1] 0 11 4 3 4
$`|Delaware||N/A|N/A|0`
[1] 0 8 3 3 1
$`District of Columbia|N/A|N/A|0`
[1] 20 3 3 1
$`|Florida| 10/1/1987||0.252|||1988`
[1] 0 7 10 5 4
$`|Georgia| 8/25/1989||0.353|||1990`
[1] 0 7 10 5 4
$`|Hawaii||N/A|N/A|0`
[1] 0 6 3 3 1
$`|Idaho||7/1/1990||0.504|||1990`
[1] 0 5 8 5 4
$`|Illinois| 1/5/2014|N/A|2014`
[1] 0 8 9 3 4
$`|Indiana| 1/15/1980||0.962|||1980`
[1] 0 7 10 5 4
$`|Iowa||1/1/2011||1.000|||2011`
[1] 0 4 8 5 4
$`|Kansas||1/1/2007||1.000|||2007`
[1] 0 6 8 5 4
$`|Kentucky| 10/1/1996||0.251|||1997`
[1] 0 8 10 5 4
$`|Louisiana|4/19/1996||0.702|||1996`
[1] 0 9 9 5 4
$`|Maine||9/19/1985||0.285|||1986`
[1] 0 5 9 5 4
$`|Maryland||N/A|N/A|0`
[1] 0 8 3 3 1
$`|Massachusetts||N/A|N/A|0`
[1] 0 13 3 3 1
$`|Michigan||7/1/2001||0.504|||2001`
[1] 0 8 8 5 4
$`|Minnesota| 5/28/2003||0.597|||2003`
[1] 0 9 10 5 4
$`|Mississippi|7/1/1990||0.504|||1990`
[1] 0 11 8 5 4
$`|Missouri| 2/26/2004||0.847|||2004`
[1] 0 8 10 5 4
$`|Montana||10/1/1991||0.252|||1992`
[1] 0 7 9 5 4
$`|Nebraska||1/1/2007||1.000|||2007`
[1] 0 8 8 5 4
$`|Nevada||10/1/1995||0.252|||1996`
[1] 0 6 9 5 4
$`|New Hampshire||1959|N/A|1959`
[1] 0 13 4 3 4
$`|New Jersey||N/A|N/A|0`
[1] 0 10 3 3 1
$`|New Mexico||1/1/2004||1.000|||2004`
[1] 0 10 8 5 4
$`|New York||N/A|N/A|0`
[1] 0 8 3 3 1
$`|North Carolina|12/1/1995||0.085|||1996`
[1] 0 14 9 5 4
$`|North Dakota| 8/1/1985||0.419|||1986`
[1] 0 12 9 5 4
$`|Ohio||4/8/2004||0.732|||2004`
[1] 0 4 8 5 4
$`|Oklahoma||1/1/1996||1.000|||1996`
[1] 0 8 8 5 4
$`|Oregon||1/1/1990||1.000|||1990`
[1] 0 6 8 5 4
$`|Pennsylvania|6/17/1989||0.542|||1989`
[1] 0 12 9 5 4
$`|Philadelphia|10/11/1995||0.225|||1996`
[1] 0 12 10 5 4
$`|Rhode Island||N/A|N/A|0`
[1] 0 12 3 3 1
$`|South Carolina|8/23/1996||0.358|||1997`
[1] 0 14 9 5 4
$`|South Dakota| 7/1/1985||0.504|||1985`
[1] 0 12 9 5 4
$`|Tennessee| 10/1/1996||0.251|||1997`
[1] 0 9 10 5 4
$`|Texas||1/1/1996||1.000|||1996`
[1] 0 5 8 5 4
$`|Utah||5/1/1995||0.671|||1995`
[1] 0 4 8 5 4
$`|Vermont||1970|N/A|1970`
[1] 0 7 4 3 4
$`|Virginia| 5/5/1995||0.660|||1995`
[1] 0 8 9 5 4
$`|Washington||1961|N/A|1961`
[1] 0 10 4 3 4
$`|West Virginia|7/7/1989||0.488|||1990`
[1] 0 13 8 5 4
$`|Wisconsin| 11/1/2011||0.167|||2012`
[1] 0 9 10 5 4
$`|Wyoming||10/1/1994||0.252|||1995`
[1] 0 7 9 5 4
p_62 <- lapply(p_62, function(x) x[nchar(x) > 0])
p_62 <- as.data.frame(do.call(rbind, p_62))
rownames(p_62) [1] "|Alabama||1975|N/A|1975"
[2] "|Alaska||10/1/1994||0.252|||1995"
[3] "|Arizona| 7/17/1994||0.460|||1995"
[4] "|Arkansas| 7/27/1995||0.433|||1996"
[5] "|California||N/A|N/A|0"
[6] "|Colorado| 5/17/2003||0.627|||2003"
[7] "|Connecticut||1970|N/A|1970"
[8] "|Delaware||N/A|N/A|0"
[9] "District of Columbia|N/A|N/A|0"
[10] "|Florida| 10/1/1987||0.252|||1988"
[11] "|Georgia| 8/25/1989||0.353|||1990"
[12] "|Hawaii||N/A|N/A|0"
[13] "|Idaho||7/1/1990||0.504|||1990"
[14] "|Illinois| 1/5/2014|N/A|2014"
[15] "|Indiana| 1/15/1980||0.962|||1980"
[16] "|Iowa||1/1/2011||1.000|||2011"
[17] "|Kansas||1/1/2007||1.000|||2007"
[18] "|Kentucky| 10/1/1996||0.251|||1997"
[19] "|Louisiana|4/19/1996||0.702|||1996"
[20] "|Maine||9/19/1985||0.285|||1986"
[21] "|Maryland||N/A|N/A|0"
[22] "|Massachusetts||N/A|N/A|0"
[23] "|Michigan||7/1/2001||0.504|||2001"
[24] "|Minnesota| 5/28/2003||0.597|||2003"
[25] "|Mississippi|7/1/1990||0.504|||1990"
[26] "|Missouri| 2/26/2004||0.847|||2004"
[27] "|Montana||10/1/1991||0.252|||1992"
[28] "|Nebraska||1/1/2007||1.000|||2007"
[29] "|Nevada||10/1/1995||0.252|||1996"
[30] "|New Hampshire||1959|N/A|1959"
[31] "|New Jersey||N/A|N/A|0"
[32] "|New Mexico||1/1/2004||1.000|||2004"
[33] "|New York||N/A|N/A|0"
[34] "|North Carolina|12/1/1995||0.085|||1996"
[35] "|North Dakota| 8/1/1985||0.419|||1986"
[36] "|Ohio||4/8/2004||0.732|||2004"
[37] "|Oklahoma||1/1/1996||1.000|||1996"
[38] "|Oregon||1/1/1990||1.000|||1990"
[39] "|Pennsylvania|6/17/1989||0.542|||1989"
[40] "|Philadelphia|10/11/1995||0.225|||1996"
[41] "|Rhode Island||N/A|N/A|0"
[42] "|South Carolina|8/23/1996||0.358|||1997"
[43] "|South Dakota| 7/1/1985||0.504|||1985"
[44] "|Tennessee| 10/1/1996||0.251|||1997"
[45] "|Texas||1/1/1996||1.000|||1996"
[46] "|Utah||5/1/1995||0.671|||1995"
[47] "|Vermont||1970|N/A|1970"
[48] "|Virginia| 5/5/1995||0.660|||1995"
[49] "|Washington||1961|N/A|1961"
[50] "|West Virginia|7/7/1989||0.488|||1990"
[51] "|Wisconsin| 11/1/2011||0.167|||2012"
[52] "|Wyoming||10/1/1994||0.252|||1995"
rownames(p_62) <- c()
colnames(p_62) <- c("STATE",
"E_Date_RTC",
"Frac_Yr_Eff_Yr_Pass",
"RTC_Date_SA")
sapply(p_62, class) STATE E_Date_RTC Frac_Yr_Eff_Yr_Pass RTC_Date_SA
"character" "character" "character" "character"
p_62 <- p_62 %>%
dplyr::select(STATE, RTC_Date_SA) %>%
rename("RTC_LAW_YEAR"=RTC_Date_SA) %>%
mutate(RTC_LAW_YEAR = as.numeric(RTC_LAW_YEAR)) %>%
mutate(RTC_LAW_YEAR = case_when(RTC_LAW_YEAR == 0 ~ Inf,
TRUE ~ RTC_LAW_YEAR))
sapply(p_62, class) STATE RTC_LAW_YEAR
"character" "numeric"
STATE RTC_LAW_YEAR
1 Alabama 1975
2 Alaska 1995
3 Arizona 1995
4 Arkansas 1996
5 California Inf
6 Colorado 2003
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "YEAR" "STATE" "VARIABLE" "VALUE"
[1] "STATE" "YEAR" "VALUE" "VARIABLE"
[1] "YEAR" "STATE" "VALUE" "VARIABLE"
[1] "YEAR" "VALUE" "STATE" "VARIABLE"
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Male_15_to_19_years 1.55
2 1977 Alabama Black_Male_20_to_39_years 3.04
3 1977 Alabama Other_Male_15_to_19_years 0.0178
4 1977 Alabama Other_Male_20_to_39_years 0.0642
5 1977 Alabama White_Male_15_to_19_years 3.58
6 1977 Alabama White_Male_20_to_39_years 11.1
# A tibble: 6 x 4
YEAR STATE VARIABLE VALUE
<dbl> <chr> <chr> <dbl>
1 1977 Alabama Black_Female_10_to_19_years 3.01
2 1977 Alabama Black_Female_20_to_29_years 2.33
3 1977 Alabama Black_Female_30_to_39_years 1.29
4 1977 Alabama Black_Female_40_to_49_years 1.18
5 1977 Alabama Black_Female_50_to_64_years 1.73
6 1977 Alabama Black_Female_65_years_and_over 1.58
# A tibble: 6 x 4
STATE YEAR VALUE VARIABLE
<chr> <dbl> <dbl> <chr>
1 Alabama 1977 7.3 Unemployment_rate
2 Alabama 1978 6.4 Unemployment_rate
3 Alabama 1979 7.2 Unemployment_rate
4 Alabama 1980 8.9 Unemployment_rate
5 Alabama 1981 10.6 Unemployment_rate
6 Alabama 1982 14.1 Unemployment_rate
# A tibble: 6 x 4
YEAR STATE VALUE VARIABLE
<dbl> <chr> <dbl> <chr>
1 2018 Alabama 16 Poverty_rate
2 2018 Alaska 13.1 Poverty_rate
3 2018 Arizona 12.8 Poverty_rate
4 2018 Arkansas 15.9 Poverty_rate
5 2018 California 11.9 Poverty_rate
6 2018 Colorado 9.1 Poverty_rate
# A tibble: 6 x 4
YEAR VALUE STATE VARIABLE
<dbl> <dbl> <chr> <chr>
1 1977 15293 Alabama Viol_crime_count
2 1978 15682 Alabama Viol_crime_count
3 1979 15578 Alabama Viol_crime_count
4 1980 17320 Alabama Viol_crime_count
5 1981 18423 Alabama Viol_crime_count
6 1982 17653 Alabama Viol_crime_count
DONOHUE_DF <- bind_rows(dem_DONOHUE,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
DONOHUE_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
DONOHUE_DF <- DONOHUE_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
DONOHUE_DF <- DONOHUE_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(DONOHUE_DF$STATE)) Alabama Alaska Arizona
31 31 31
Arkansas California Colorado
31 31 31
Connecticut District of Columbia Delaware
31 31 31
Florida Georgia Hawaii
31 31 31
Idaho Illinois Indiana
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Hampshire
31 31 31
New Jersey New Mexico New York
31 31 31
North Carolina North Dakota Ohio
31 31 31
Oklahoma Oregon Pennsylvania
31 31 31
Rhode Island South Carolina South Dakota
31 31 31
Tennessee Texas Utah
31 31 31
Vermont Virginia Washington
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
baseline_year <- min(DONOHUE_DF$YEAR)
censoring_year <- max(DONOHUE_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
DONOHUE_DF <- DONOHUE_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
DONOHUE_DF <- DONOHUE_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(DONOHUE_DF$STATE))) Alaska Arizona Arkansas
31 31 31
California Colorado District of Columbia
31 31 31
Delaware Florida Georgia
31 31 31
Hawaii Idaho Illinois
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Jersey
31 31 31
New Mexico New York North Carolina
31 31 31
North Dakota Ohio Oklahoma
31 31 31
Oregon Pennsylvania Rhode Island
31 31 31
South Carolina South Dakota Tennessee
31 31 31
Texas Utah Virginia
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
[1] 45
LOTT_DF <- bind_rows(dem_LOTT,
ue_rate_data,
poverty_rate_data,
crime_data,
population_data,
ps_data) %>%
pivot_wider(names_from = "VARIABLE",
values_from = "VALUE") %>%
left_join(p_62 , by = c("STATE")) %>%
mutate(RTC_LAW = case_when(YEAR >= RTC_LAW_YEAR ~ TRUE,
TRUE ~ FALSE))
LOTT_DF %>%
group_by(YEAR) %>%
tally() %>%
filter(n != 51) %>%
print(n=dim(.)[1])# A tibble: 33 x 2
YEAR n
<dbl> <int>
1 1980 52
2 1981 52
3 1982 52
4 1983 52
5 1984 52
6 1985 52
7 1986 52
8 1987 52
9 1988 52
10 1989 52
11 1990 52
12 1991 52
13 1992 52
14 1993 52
15 1994 52
16 1995 52
17 1996 52
18 1997 52
19 1998 52
20 1999 52
21 2000 52
22 2001 52
23 2002 52
24 2003 52
25 2004 52
26 2005 52
27 2006 52
28 2007 52
29 2008 52
30 2009 52
31 2010 52
32 2011 52
33 2012 52
Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut D.C. Delaware
44 33 44
District of Columbia Florida Georgia
44 44 44
Hawaii Idaho Illinois
44 44 44
Indiana Iowa Kansas
44 44 44
Kentucky Louisiana Maine
44 44 44
Maryland Massachusetts Michigan
44 44 44
Minnesota Mississippi Missouri
44 44 44
Montana Nebraska Nevada
44 44 44
New Hampshire New Jersey New Mexico
44 44 44
New York North Carolina North Dakota
44 44 44
Ohio Oklahoma Oregon
44 44 44
Pennsylvania Rhode Island South Carolina
44 44 44
South Dakota Tennessee Texas
44 44 44
Utah Vermont Virginia
44 44 44
Washington West Virginia Wisconsin
44 44 44
Wyoming
44
[1] 44
LOTT_DF <- LOTT_DF %>%
mutate(STATE = fct_collapse(STATE, "District of Columbia"=c("District of Columbia","D.C.")))
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
44 44 44
Arkansas California Colorado
44 44 44
Connecticut District of Columbia Delaware
44 77 44
Florida Georgia Hawaii
44 44 44
Idaho Illinois Indiana
44 44 44
Iowa Kansas Kentucky
44 44 44
Louisiana Maine Maryland
44 44 44
Massachusetts Michigan Minnesota
44 44 44
Mississippi Missouri Montana
44 44 44
Nebraska Nevada New Hampshire
44 44 44
New Jersey New Mexico New York
44 44 44
North Carolina North Dakota Ohio
44 44 44
Oklahoma Oregon Pennsylvania
44 44 44
Rhode Island South Carolina South Dakota
44 44 44
Tennessee Texas Utah
44 44 44
Vermont Virginia Washington
44 44 44
West Virginia Wisconsin Wyoming
44 44 44
[1] 51
LOTT_DF <- LOTT_DF %>%
group_by(STATE, YEAR) %>%
summarise_all(~na.omit(unique(.))) %>%
ungroup() # This identifies unique observations, coalesces rows according to the grouping variable(s), and gets rid of of units that have incomplete data. This gives returns a dataframe with the most complete information.
summary(as.factor(LOTT_DF$STATE)) Alabama Alaska Arizona
31 31 31
Arkansas California Colorado
31 31 31
Connecticut District of Columbia Delaware
31 31 31
Florida Georgia Hawaii
31 31 31
Idaho Illinois Indiana
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Hampshire
31 31 31
New Jersey New Mexico New York
31 31 31
North Carolina North Dakota Ohio
31 31 31
Oklahoma Oregon Pennsylvania
31 31 31
Rhode Island South Carolina South Dakota
31 31 31
Tennessee Texas Utah
31 31 31
Vermont Virginia Washington
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
baseline_year <- min(LOTT_DF$YEAR)
censoring_year <- max(LOTT_DF$YEAR)
# Need to fix this to ensure severe bias is not introduced by prevalent "cases"
LOTT_DF <- LOTT_DF %>%
mutate(TIME_0 = baseline_year,
TIME_INF = censoring_year) %>%
filter(RTC_LAW_YEAR > TIME_0)
LOTT_DF <- LOTT_DF %>%
mutate(Viol_crime_rate_1k = (Viol_crime_count*1000)/Population,
Viol_crime_rate_1k_log = log(Viol_crime_rate_1k),
Population_log = log(Population))
summary(droplevels(as.factor(LOTT_DF$STATE))) Alaska Arizona Arkansas
31 31 31
California Colorado District of Columbia
31 31 31
Delaware Florida Georgia
31 31 31
Hawaii Idaho Illinois
31 31 31
Iowa Kansas Kentucky
31 31 31
Louisiana Maine Maryland
31 31 31
Massachusetts Michigan Minnesota
31 31 31
Mississippi Missouri Montana
31 31 31
Nebraska Nevada New Jersey
31 31 31
New Mexico New York North Carolina
31 31 31
North Dakota Ohio Oklahoma
31 31 31
Oregon Pennsylvania Rhode Island
31 31 31
South Carolina South Dakota Tennessee
31 31 31
Texas Utah Virginia
31 31 31
West Virginia Wisconsin Wyoming
31 31 31
[1] 45
STATE YEAR Black_Male_15_to_19_years
"factor" "numeric" "numeric"
Black_Male_20_to_39_years Other_Male_15_to_19_years Other_Male_20_to_39_years
"numeric" "numeric" "numeric"
White_Male_15_to_19_years White_Male_20_to_39_years Unemployment_rate
"numeric" "numeric" "numeric"
Poverty_rate Viol_crime_count Population
"numeric" "numeric" "numeric"
police_per_100k_lag RTC_LAW_YEAR RTC_LAW
"numeric" "numeric" "logical"
TIME_0 TIME_INF Viol_crime_rate_1k
"numeric" "numeric" "numeric"
Viol_crime_rate_1k_log Population_log
"numeric" "numeric"
DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log, color = STATE)) +
geom_point(size = 0.5) +
geom_line(aes(group=STATE),
size = 0.5,
show.legend = FALSE) +
geom_text_repel(data = DONOHUE_DF %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
filter(YEAR == last(YEAR)),
aes(label = STATE,
x = YEAR,
y = Viol_crime_rate_100k_log),
size = 3,
alpha = 1,
nudge_x = 10,
direction = "y",
hjust = 1,
vjust = 1,
segment.size = 0.25,
segment.alpha = 0.25,
force = 1,
max.iter = 9999) +
guides(color = FALSE) +
scale_x_continuous(breaks = seq(1980, 2015, by = 1),
limits = c(1980, 2015),
labels = c(seq(1980, 2010, by = 1), rep("", 5))) +
scale_y_continuous(breaks = seq(3.5, 8.5, by = 0.5),
limits = c(3.5, 8.5)) +
labs(title = "States have different levels of crime",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))DONOHUE_DF %>%
group_by(YEAR) %>%
summarise(Viol_crime_count = sum(Viol_crime_count),
Population = sum(Population),
.groups = "drop") %>%
mutate(Viol_crime_rate_100k_log = log((Viol_crime_count*100000)/Population)) %>%
ggplot(aes(x = YEAR, y = Viol_crime_rate_100k_log)) +
geom_line() +
scale_x_continuous(breaks = seq(1980, 2010, by = 1),
limits = c(1980, 2010),
labels = c(seq(1980, 2010, by = 1))) +
scale_y_continuous(breaks = seq(5.75, 6.75, by = 0.25),
limits = c(5.75, 6.75)) +
labs(title = "Crime rates fluctuate over time",
x = "Year",
y = "ln(violent crimes per 100,000 people)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
d_panel_DONOHUE <- pdata.frame(DONOHUE_DF, index=c("STATE", "YEAR"))
DONOHUE_OUTPUT <- plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
effect = "twoways",
model = "within",
data=d_panel_DONOHUE)
summary(DONOHUE_OUTPUT)Twoways effects Within Model
Call:
plm(formula = Viol_crime_rate_1k_log ~ RTC_LAW + White_Male_15_to_19_years +
White_Male_20_to_39_years + Black_Male_15_to_19_years + Black_Male_20_to_39_years +
Other_Male_15_to_19_years + Other_Male_20_to_39_years + Unemployment_rate +
Poverty_rate + Population_log + police_per_100k_lag, data = d_panel_DONOHUE,
effect = "twoways", model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.5716985 -0.0933827 0.0022014 0.0896372 1.0943035
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE 0.02306214 0.01666508 1.3839 0.1666372
White_Male_15_to_19_years -0.00263364 0.02732105 -0.0964 0.9232208
White_Male_20_to_39_years 0.04060493 0.00960488 4.2275 2.527e-05 ***
Black_Male_15_to_19_years -0.10348754 0.05695300 -1.8171 0.0694351 .
Black_Male_20_to_39_years 0.12003823 0.01938589 6.1920 7.934e-10 ***
Other_Male_15_to_19_years 0.66332029 0.11311115 5.8643 5.703e-09 ***
Other_Male_20_to_39_years -0.25380488 0.03938074 -6.4449 1.622e-10 ***
Unemployment_rate -0.01626228 0.00490799 -3.3134 0.0009468 ***
Poverty_rate -0.00890507 0.00295638 -3.0122 0.0026438 **
Population_log -0.22442622 0.06060682 -3.7030 0.0002219 ***
police_per_100k_lag 0.00047990 0.00013737 3.4935 0.0004926 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 36.917
R-Squared: 0.14565
Adj. R-Squared: 0.090168
F-statistic: 20.2864 on 11 and 1309 DF, p-value: < 2.22e-16
Some code taken from http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
LOTT_variables <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames()
LOTT_fmla <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables, collapse = " + ")
)
)
d_panel_LOTT <- pdata.frame(LOTT_DF, index=c("STATE", "YEAR"))
LOTT_OUTPUT <- plm(LOTT_fmla,
model = "within",
effect = "twoways",
data=d_panel_LOTT)
summary(LOTT_OUTPUT)Twoways effects Within Model
Call:
plm(formula = LOTT_fmla, data = d_panel_LOTT, effect = "twoways",
model = "within")
Balanced Panel: n = 45, T = 31, N = 1395
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.565457 -0.078747 0.001635 0.079232 0.577838
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
RTC_LAWTRUE -0.05422970 0.01658286 -3.2702 0.001103 **
White_Female_10_to_19_years 0.65063823 0.15149131 4.2949 1.880e-05 ***
White_Female_20_to_29_years -0.02997137 0.06349534 -0.4720 0.636990
White_Female_30_to_39_years 0.13132568 0.08104309 1.6204 0.105384
White_Female_40_to_49_years 0.09211246 0.08234849 1.1186 0.263534
White_Female_50_to_64_years -0.37475798 0.06335128 -5.9156 4.240e-09 ***
White_Female_65_years_and_over 0.20314547 0.04759664 4.2681 2.117e-05 ***
White_Male_10_to_19_years -0.61566593 0.14489592 -4.2490 2.303e-05 ***
White_Male_20_to_29_years 0.06466063 0.05901511 1.0957 0.273433
White_Male_30_to_39_years -0.10412806 0.08620494 -1.2079 0.227304
White_Male_40_to_49_years -0.21815118 0.07329480 -2.9764 0.002972 **
White_Male_50_to_64_years 0.38433281 0.07355515 5.2251 2.031e-07 ***
White_Male_65_years_and_over -0.21601720 0.06691569 -3.2282 0.001277 **
Black_Female_10_to_19_years -1.20662463 0.43502280 -2.7737 0.005623 **
Black_Female_20_to_29_years 0.02942780 0.17544389 0.1677 0.866820
Black_Female_30_to_39_years -0.15149500 0.20475568 -0.7399 0.459508
Black_Female_40_to_49_years 0.42380646 0.23611111 1.7949 0.072898 .
Black_Female_50_to_64_years 0.13802304 0.21419499 0.6444 0.519444
Black_Female_65_years_and_over -0.07820224 0.24128320 -0.3241 0.745908
Black_Male_10_to_19_years 1.36102016 0.44538035 3.0559 0.002291 **
Black_Male_20_to_29_years -0.14034048 0.18460396 -0.7602 0.447260
Black_Male_30_to_39_years 0.37937138 0.23590699 1.6081 0.108051
Black_Male_40_to_49_years -0.58107771 0.27188867 -2.1372 0.032772 *
Black_Male_50_to_64_years -0.25317586 0.24011393 -1.0544 0.291899
Black_Male_65_years_and_over -0.46825225 0.34645213 -1.3516 0.176754
Other_Female_10_to_19_years 0.57481127 0.49957581 1.1506 0.250112
Other_Female_20_to_29_years -1.12453492 0.27172673 -4.1385 3.725e-05 ***
Other_Female_30_to_39_years -3.15698149 0.35788016 -8.8213 < 2.2e-16 ***
Other_Female_40_to_49_years 0.96646809 0.42423419 2.2781 0.022882 *
Other_Female_50_to_64_years 2.97254960 0.34040734 8.7323 < 2.2e-16 ***
Other_Female_65_years_and_over 2.25872753 0.20551782 10.9904 < 2.2e-16 ***
Other_Male_10_to_19_years 0.24715044 0.48305107 0.5116 0.608988
Other_Male_20_to_29_years 1.58436219 0.25907190 6.1155 1.276e-09 ***
Other_Male_30_to_39_years 2.91519635 0.41689628 6.9926 4.336e-12 ***
Other_Male_40_to_49_years -1.22100778 0.44740943 -2.7291 0.006439 **
Other_Male_50_to_64_years -3.92082993 0.37595040 -10.4291 < 2.2e-16 ***
Other_Male_65_years_and_over -4.10090950 0.37041352 -11.0712 < 2.2e-16 ***
Unemployment_rate -0.00499765 0.00437844 -1.1414 0.253907
Poverty_rate -0.00571967 0.00254133 -2.2507 0.024576 *
Population_log -0.26192101 0.08472606 -3.0914 0.002035 **
police_per_100k_lag 0.00051043 0.00012220 4.1771 3.153e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Total Sum of Squares: 43.211
Residual Sum of Squares: 23.825
R-Squared: 0.44863
Adj. R-Squared: 0.39906
F-statistic: 25.3824 on 41 and 1279 DF, p-value: < 2.22e-16
comparing_analyses <- DONOHUE_OUTPUT_TIDY %>%
bind_rows(LOTT_OUTPUT_TIDY) %>%
filter(term == "RTC_LAWTRUE")
library(grid)
comparing_analyses_plot <- ggplot(comparing_analyses) +
geom_point(aes(x = Analysis, y = estimate)) +
geom_errorbar(aes(x = Analysis, ymin = conf.low, ymax = conf.high), width = 0.25) +
geom_hline(yintercept = 0, color = "red") +
scale_y_continuous(breaks = seq(-0.2, 0.2, by = 0.05),
labels = seq(-0.2, 0.2, by = 0.05),
limits = c(-0.2,0.2)) +
geom_segment(aes(x = 1, y = 0.125, xend = 1, yend = 0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "green",
lineend = "butt",
linejoin = "mitre") +
geom_segment(aes(x = 2, y = -0.125, xend = 2, yend = -0.175),
arrow = arrow(angle = 45, ends = "last", type = "open"),
size = 2,
color = "red",
lineend = "butt",
linejoin = "mitre") +
theme_minimal() +
theme(axis.title.x = element_blank(),
axis.text = element_text(size = 12)) +
labs(title = "Effect estimate on ln(violent crimes per 100,000 people)",
y = "Effect estimate (95% CI)")
comparing_analyses_plotHow did the above happen?
The analysis dataframes are very similar yet rendered very different results.
- different number of columns: 20 vs 50
[1] TRUE
The only difference between the two dataframes rests in how the demographic variables were parameterized.
[1] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[3] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[5] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[1] "Black_Female_10_to_19_years" "Black_Female_20_to_29_years"
[3] "Black_Female_30_to_39_years" "Black_Female_40_to_49_years"
[5] "Black_Female_50_to_64_years" "Black_Female_65_years_and_over"
[7] "Black_Male_10_to_19_years" "Black_Male_20_to_29_years"
[9] "Black_Male_30_to_39_years" "Black_Male_40_to_49_years"
[11] "Black_Male_50_to_64_years" "Black_Male_65_years_and_over"
[13] "Other_Female_10_to_19_years" "Other_Female_20_to_29_years"
[15] "Other_Female_30_to_39_years" "Other_Female_40_to_49_years"
[17] "Other_Female_50_to_64_years" "Other_Female_65_years_and_over"
[19] "Other_Male_10_to_19_years" "Other_Male_20_to_29_years"
[21] "Other_Male_30_to_39_years" "Other_Male_40_to_49_years"
[23] "Other_Male_50_to_64_years" "Other_Male_65_years_and_over"
[25] "White_Female_10_to_19_years" "White_Female_20_to_29_years"
[27] "White_Female_30_to_39_years" "White_Female_40_to_49_years"
[29] "White_Female_50_to_64_years" "White_Female_65_years_and_over"
[31] "White_Male_10_to_19_years" "White_Male_20_to_29_years"
[33] "White_Male_30_to_39_years" "White_Male_40_to_49_years"
[35] "White_Male_50_to_64_years" "White_Male_65_years_and_over"
Clearly, this had an effect on the results of the analysis.
Let’s explore how this occured.
When seemingly independent variables are highly related to one another, the relationships estimated in an analysis may be distorted.
In regression analysis, this distortion is often a byproduct of a violation of the independence assumption. This distortion, if large enough, can impact statistical inference.
There are several ways we can diagnose multicollinearity.
Again, multicollinearity often occurs when independent variables are highly related to one another. Consequently, we can evaluate these relationships be examining the correlation between variable pairs.
It is important to note that multicollinearity and correlation are not one and the same. Correlation can be thought of as the strength of the relationship between variables. On the other hand, multicollinearity can be thought of the the violation of the independence assumption that is a consequence of this correlation in a regression analysis.
[1] "STATE" "YEAR"
[3] "Black_Male_15_to_19_years" "Black_Male_20_to_39_years"
[5] "Other_Male_15_to_19_years" "Other_Male_20_to_39_years"
[7] "White_Male_15_to_19_years" "White_Male_20_to_39_years"
[9] "Unemployment_rate" "Poverty_rate"
[11] "Viol_crime_count" "Population"
[13] "police_per_100k_lag" "RTC_LAW_YEAR"
[15] "RTC_LAW" "TIME_0"
[17] "TIME_INF" "Viol_crime_rate_1k"
[19] "Viol_crime_rate_1k_log" "Population_log"
DONOHUE_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))LOTT_DF %>%
dplyr::select(RTC_LAW,
Viol_crime_rate_1k_log,
Unemployment_rate,
Poverty_rate,
Population_log) %>%
ggpairs(.,
columns = c(2:5),
lower = list(continuous = wrap("smooth_loess",
color = "red",
alpha = 0.5,
size = 0.1)))cor_DONOHUE_dem <- cor(DONOHUE_DF %>% dplyr::select(contains("_years")))
corr_mat_DONOHUE <- ggcorrplot(cor_DONOHUE_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 1",
legend.title = TeX("$\\rho$"))
corr_mat_DONOHUEcor_LOTT_dem <- cor(LOTT_DF %>% dplyr::select(contains("_years")))
corr_mat_LOTT <- ggcorrplot(cor_LOTT_dem,
tl.cex = 6,
hc.order = TRUE,
colors = c("red",
"white",
"red"),
outline.color = "transparent",
title = "Correlation Matrix, Analysis 2",
legend.title = TeX("$\\rho$"))
corr_mat_LOTTsims <- 250
# DONOHUE
# round(dim(DONOHUE_DF)[1]/2)
samps_DONOHUE <- lapply(rep(dim(DONOHUE_DF)[1]-1, sims),
function(x)DONOHUE_DF[sample(nrow(DONOHUE_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_DONOHUE <- function(split){
plm(Viol_crime_rate_1k_log ~
RTC_LAW +
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_DONOHUE <- lapply(samps_DONOHUE, fit_nls_on_bootstrap_DONOHUE)
samps_models_DONOHUE <- samps_models_DONOHUE %>%
map(tidy)
names(samps_models_DONOHUE) <- paste0("DONOHUE_",1:length(samps_models_DONOHUE))
simulations_DONOHUE <- samps_models_DONOHUE %>%
bind_rows(.id = "ID") %>%
mutate(Analysis = "Analysis 1")
## LOTT
samps_LOTT <- lapply(rep(round(dim(LOTT_DF)[1]/2), sims),
function(x) LOTT_DF[sample(nrow(LOTT_DF),
size = x, replace = FALSE),])
fit_nls_on_bootstrap_LOTT <- function(split){
plm(LOTT_fmla,
data = data.frame(split),
index = c("STATE","YEAR"),
model = "within",
effect = "twoways")
}
samps_models_LOTT <- lapply(samps_LOTT, fit_nls_on_bootstrap_LOTT)
samps_models_LOTT <- samps_models_LOTT %>%
map(tidy)
names(samps_models_LOTT) <- paste0("LOTT_",1:length(samps_models_LOTT))
simulations_LOTT <- samps_models_LOTT %>%
bind_rows(.id = "Analysis") %>%
mutate(Analysis = "Analysis 2")
simulations <- bind_rows(simulations_DONOHUE,
simulations_LOTT)
simulation_plot <- simulations %>%
filter(term=="RTC_LAWTRUE") %>%
ggplot(aes(x = Analysis, y = estimate)) +
geom_jitter(alpha = 0.25,
width = 0.1) +
labs(title = "Coefficient instability",
subtitle = "Estimates sensitive to observation deletions",
x = "Term",
y = "Coefficient",
caption = "Results from simulations") +
theme_minimal() +
theme(axis.title.x = element_blank())
simulation_plotdesign.matrix <- as.data.frame(model.matrix(DONOHUE_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_DONOHUE$Viol_crime_rate_1k_log)
lm_DONOHUE <- lm(Viol_crime_rate_1k_log ~
RTC_LAWTRUE + # logical class changes variable name after inital model
White_Male_15_to_19_years +
White_Male_20_to_39_years +
Black_Male_15_to_19_years +
Black_Male_20_to_39_years +
Other_Male_15_to_19_years +
Other_Male_20_to_39_years +
Unemployment_rate +
Poverty_rate +
Population_log +
police_per_100k_lag,
data = design.matrix)
vif(lm_DONOHUE) RTC_LAWTRUE White_Male_15_to_19_years White_Male_20_to_39_years
1.095268 1.172703 1.685342
Black_Male_15_to_19_years Black_Male_20_to_39_years Other_Male_15_to_19_years
1.313339 1.656860 1.574839
Other_Male_20_to_39_years Unemployment_rate Poverty_rate
1.623750 1.242150 1.262682
Population_log police_per_100k_lag
1.154825 1.163058
vif_DONOHUE <- vif(lm_DONOHUE)
vif_DONOHUE <- vif_DONOHUE %>%
as_tibble() %>%
cbind(., names(vif_DONOHUE)) %>%
as_tibble()
colnames(vif_DONOHUE) <- c("VIF", "Variable")
max_vif_DONOHUE <- max(vif(lm_DONOHUE)) design.matrix <- as.data.frame(model.matrix(LOTT_OUTPUT))
design.matrix$Viol_crime_rate_1k_log <- plm::Within(
d_panel_LOTT$Viol_crime_rate_1k_log)
LOTT_variables_ols <- LOTT_DF %>%
dplyr::select(RTC_LAW,
contains(c("White","Black","Other")),
Unemployment_rate,
Poverty_rate,
Population_log,
police_per_100k_lag) %>%
colnames() %>%
str_replace("RTC_LAW", "RTC_LAWTRUE") # logical class changes variable name after inital model
LOTT_fmla_ols <- as.formula(paste("Viol_crime_rate_1k_log ~",
paste(LOTT_variables_ols, collapse = " + ")
)
)
lm_LOTT <- lm(LOTT_fmla_ols,
data = design.matrix)
vif(lm_LOTT) RTC_LAWTRUE White_Female_10_to_19_years
1.641916 127.733327
White_Female_20_to_29_years White_Female_30_to_39_years
42.184712 49.980961
White_Female_40_to_49_years White_Female_50_to_64_years
37.856684 36.547007
White_Female_65_years_and_over White_Male_10_to_19_years
12.863500 126.795600
White_Male_20_to_29_years White_Male_30_to_39_years
39.477520 73.002593
White_Male_40_to_49_years White_Male_50_to_64_years
31.686738 52.974511
White_Male_65_years_and_over Black_Female_10_to_19_years
13.311999 330.982883
Black_Female_20_to_29_years Black_Female_30_to_39_years
106.841150 78.289450
Black_Female_40_to_49_years Black_Female_50_to_64_years
98.421635 66.551531
Black_Female_65_years_and_over Black_Male_10_to_19_years
48.358137 318.032781
Black_Male_20_to_29_years Black_Male_30_to_39_years
89.283293 87.978290
Black_Male_40_to_49_years Black_Male_50_to_64_years
91.913602 64.235719
Black_Male_65_years_and_over Other_Female_10_to_19_years
37.575659 143.610940
Other_Female_20_to_29_years Other_Female_30_to_39_years
65.320481 55.395405
Other_Female_40_to_49_years Other_Female_50_to_64_years
222.043147 132.105354
Other_Female_65_years_and_over Other_Male_10_to_19_years
82.816114 153.770551
Other_Male_20_to_29_years Other_Male_30_to_39_years
54.915467 63.326933
Other_Male_40_to_49_years Other_Male_50_to_64_years
241.793319 174.690518
Other_Male_65_years_and_over Unemployment_rate
53.654443 1.496689
Poverty_rate Population_log
1.412607 3.416913
police_per_100k_lag
1.393440
vif_LOTT <- vif(lm_LOTT)
vif_LOTT <- vif_LOTT %>%
as_tibble() %>%
cbind(., names(vif_LOTT)) %>%
as_tibble()
colnames(vif_LOTT) <- c("VIF", "Variable")
max_vif_LOTT <- max(vif(lm_LOTT))[1] 1.685342
[1] 330.9829
\[\frac{1}{1-R_{i}^{2}}\]
vif_DONOHUE$Analysis <- "Analysis 1"
vif_LOTT$Analysis <- "Analysis 2"
vif_df <- rbind(vif_DONOHUE,
vif_LOTT)
vif_plot <- vif_df %>%
ggplot(aes(x = Analysis, y = VIF)) +
geom_jitter(width = 0.1, alpha = 0.5, size = 2) +
geom_hline(yintercept = 10, color = "red") +
scale_y_continuous(trans = 'log10',
limits = c(1,1000)) +
labs(title = "Variance inflation factors") +
theme_minimal() +
theme(axis.title.x = element_blank())
vif_plothttps://rpubs.com/rslbliss/fixed_effects
http://karthur.org/2019/implementing-fixed-effects-panel-models-in-r.html
https://stats.stackexchange.com/questions/99236/effects-in-panel-models-individual-time-or-twoways